-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet: don't truncate f16/decimal min/max stats #5154
Parquet: don't truncate f16/decimal min/max stats #5154
Conversation
For reference parquet-mr has similar logic, difference being instead of opting out of truncation for Decimal, they enforce that types need to opt-in: Which has similar effect in ensuring Decimal isn't truncated (Float16 support isn't yet merged in for parquet-mr, but they aren't changing that logic in the Float16 PR so presumably will also not do truncation: apache/parquet-java#1142) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the only place we perform truncation, or do we also need to do something similar for the ColumnChunkMetadata statistics?
Those were the only places that called the truncate min/max functions |
What about in write_column_metadata? |
Statistics::FixedLenByteArray(stats) | ||
if (stats.has_min_max_set() && self.can_truncate_value()) => | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about in write_column_metadata?
This covers the truncation in write_column_metadata(..)
Which issue does this PR close?
Closes #5075
Rationale for this change
All logical types with
BYTE_ARRAY
/FIXED_LEN_BYTE_ARRAY
underlying physical types share the sort order as their physical type, that being unsigned byte-wise comparison, except forFloat16
andDecimal
. Applying the truncation logic to their min/max statistics can therefore lead to inaccurate statistics.What changes are included in this PR?
Add check before truncating min/max statistics to ensure
Decimal
andFloat16
columns don't get truncated.Are there any user-facing changes?